perm filename FORMAN[KI,ALS] blob
sn#097064 filedate 1974-04-14 generic text, type T, neo UTF8
00100 The Stanford AI Pitch-Synchronous Fourier-Transform Formant Extractor
00200
00300 The formant extractor is not a formant tracker in the usual sense
00400 since a fresh determination of the formant locations is made for each
00500 segment independently. This is thought to be desirable as it reveals
00600 rapid changes in formant location, particularly in the vicinity of
00700 obstruants where the character of the obstruant is frequently
00800 revealed more by these rapid transitions than by anything else. Only
00900 after this has been done is any attempt made to reconcile data for
01000 adjacent segments, as will be explained later.
01100
01200 Formant identification is based on the use of Fourier transforms
01300 using single pitch period segments where the segment starts and ends
01400 at the zero crossing which precedes the maximum excursion in
01500 amplitude.
01600
01700 A study has been made of the effects of the segment location within
01800 the period and of the effect of the segment length. In general
01900 cleaner transforms are produced when the segment length is something
02000 less than the full period, 80% seems to be a reasonable compromise
02100 between cleanness and unwarranted broadening of the peaks in the
02200 spectrum because of insufficient points of data. However, it is
02300 questioned whether this is a reasonable thing to do since the
02400 location of the formant peaks is affected by the glottal loading
02500 during the latter part of the period and this is, of course, removed.
02600 It seems more reasonable to assume that the speaker modifies the
02700 shape of his upper vocal tract to compensate for his own pecular
02800 glottal loading effects since he attempts to produce sounds that
02900 match those produced by others and it is highly unlikely that the ear
03000 can do anything to disambiguate glottal coupling effects. It is
03100 observed that this glottal loading effect is more pronounced for
03200 pitch periods that happen to be longer than the average. For all
03300 appearances it seems that most speakers delay the closing of the
03400 glottis rather than lengthening the closed time when they drop the
03500 pitch of their voice. A reasonable thing to do thus seems to be to
03600 use the full period for intervals are normal or shorter and to
03700 restrict the length to the average length for long periods.
03800
03900 The location of the formant peaks is also shifted somewhat by shifts
04000 in the starting point in the period since windowing attenuates
04100 contributions to the transform from the edge portions of the data but
04200 this effect is small as compared with the increase in ease with which
04300 the peaks can be located for the starting location as mentioned.
04400
04500 The first operation is to locate the largest proper peaks found in
04600 each of six regions, these being the usual ranges for the first five
04700 formants and the region below the usual lower limit for the first
04800 formant. These limits are shifted between male and female voices, but
04900 in general we have not found it necessary to adjust them for the
05000 specific speaker. A proper peak is defined as the largest local
05100 maximum in the region that is bounded on both sides by points that
05200 are of lessor amplitude. If the five points for the five formant
05300 regions are distinct, that is no two are assigned the same value, the
05400 points are accepted as is, subject to a final medial smoothing
05500 operation which will be discribed later.
05600
05700 Since the ranges for the formants overlap, frequent conflicts occur
05800 and thes must now be resolved. This is done starting at the low
05900 frequency end. Somewhat different strategies are used for different
06000 possible conflicts.
06100
06200 Should the first and second formants identifications conflict then
06300 searches are made for the next largest proper peaks, to the low
06400 frequency side extending the region to zero, and to the high
06500 frequency side to the upper limit of the F2 band. The amplitudes of
06600 these two new peaks and their positions with respect to median values
06700 for the F1 and F2 regions are then compared. Actually a decision made
06800 on the basis of amplitude only, allowing a 6 db credit for the higher
06900 frequency peak, seems to make the right decision almost always. A
07000 study will be made of this matter when a larger sample of data
07100 becomes available.
07200
07300 Having resolved the conflict between F1 and F2, attention is then
07400 directed to a possible conflict between F2 and F3 which may have been
07500 introduced by the resolution of the F1 F2 conflict or which may have
07600 been there initially. If a conflict is newly introduced then a second
07700 look is given to the F1 F2 conflict. Recourse is now made of a
07800 procedure to locate a possible F2 peak that had been obscured by a
07900 dominant F1 peak. The approximate shape of the original F1-F2 peak is
08000 assumed to be parobolic as determined from three data points these
08100 being that point at the maximum and points nearest the two three db
08200 down values. A fresh attempt is made to locate a new peak between the
08300 location of the disputed peak which is now extracted out from the
08400 data and the location previously found for F3. If such a peak is
08500 found it is assigned to F2 and attention is shifted to a possible
08600 F3-F4 conflict.
08700
08800 Should an initial conflict be found between F2 and F3, this is
08900 resolved in essentially the same way except that no attempt is made
09000 to find a possible hidden F3 as was done for F2. Instead, if a
09100 conflict between F4 and F5 is produced by the resolution of an F3-F4
09200 conflict then this is resolved just as if it were an initial
09300 conflict.
09400
09500
09600 Under certain circumstances it seems to be impossible to resolve all
09700 conflicts by the procedures just discribed. When this occurs the
09800 fai,lure to locate a proper peak is signaled by storing a zero for
09900 the formant in question and the program proceeds to the next formant.
10000 On the completion of this first go-around a second look is given to
10100 any zero values, and finally if still unresolved the zeros are
10200 replaced by the value for the formant in question by the value found
10300 for the previous time slot.
10400
10500 Having resolved all conflicts in this way, then the exact locations
10600 for peaks are refined by parobolic interpolations based on the
10700 positions of the highest point and its two nearest neighbors. It is
10800 doubtful if the greater precision which results from this operation
10900 is at all needed, at least in the case of 512 point transforms on
11000 20,000 hertz data. At least 2 bits of added precision can be obtained
11100 and the greatly improved smoothness of the resulting formant tracks
11200 seems to indicate that a corresponding increase in accuracy has
11300 resulted.
11400
11500 The procedures so far describe result in very good formant tracks.
11600 However there are still isolated points which appear to be out of
11700 line. Most of these appear to be situations where a person would be
11800 quite unable to make an assured decision. A certain few can be traced
11900 to failures in the pitch period determining procedure while others
12000 are due to more obscure reasons. In almost all cases these
12100 abnormalities persist for but a single pitch period and they can be
12200 corrected by a final process of medial smoothing. This is done in one
12300 direction only, going forward in time each value for each formant is
12400 replaced by the median value of the point in question, its
12500 predecessor (as already corrected) and its successor. Individual
12600 points which lie between their neighbors are not altered by this
12700 procedure. Errant points are replaced by values for the nearest
12800 neighbor. This procedure does have the effect of correcting true
12900 extrema but an extrema which persists for but a single pitch period
13000 probably does not contain much phonetic information and can probably
13100 be ignored. One could make allowances for true extrema by applying
13200 the medial smoothing only to points that lie more than, say, 2 db
13300 away from their nearest neighbor. This refinement seems entirely
13400 unnecessary but it is being kept in reserve.
13500
13600 The advantages of this method of formant extraction over other more
13700 conventional tracking procedures seem to lie in the much improved
13800 results in the vicinity of obstruents where the rapid changes in
13900 formant location can be masked by tracking and where information as
14000 to the nature of the obstruent is contained in this transition
14100 region.